Hidden Layer Training via Hessian Matrix Information
نویسندگان
چکیده
The output weight optimization-hidden weight optimization (OWO-HWO) algorithm for training the multilayer perceptron alternately updates the output weights and the hidden weights. This layer-by-layer training strategy greatly improves convergence speed. However, in HWO, the desired net function actually evolves in the gradient direction, which inevitably reduces efficiency. In this paper, two improvements to the OWO-HWO algorithm are presented. New desired net functions are proposed for hidden layer training, which use Hessian matrix information rather than gradients. A weighted hidden layer error function, taking saturation into consideration, is derived directly from the global error function. Both techniques greatly increase training speed. Faster convergence is verified by simulations with remote sensing data sets. Introduction The multi-layer perceptron (MLP) is widely used in the fields of signal processing, remote sensing, and pattern recognition. Since back propagation (BP) was first proposed for MLP training (Werbos 1974), many researchers have attempted to improve its convergence speed. Techniques used to improve convergence include second order information (Battiti 1992, Mollor 1997), training network layer by layer (Wang and Chen 1996, Parisi et al. 1996), avoiding saturation (Yam and Chow 2000, Lee, Chen and Huang 2001) and adapting the learning factor (Magoulas, Vrahatis and Androulakis 1999, Nachtsheim 1994). Algorithms like Qprop (Fahlman 1989), conjugate gradient (Fletch 1987, Kim 2003), Levenberg-Marquardt (LM) (Hagan and Menhaj 1994, Fletch 1987) often perform much better than BP. In these algorithms, the essential difference is the weight updating strategy. The convergence speeds are quite different when the weights are modified in the gradient direction, a conjugate direction or the Newton direction. As to which one is better, this depends on the nature of the application, the computational load and other factors. Generally, gradient methods perform worst. While the Newton method performs best, it requires more computation time. Copyright © 2002, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. Chen (Chen, Manry and Chandrasekaran 1999) constructed a batch mode training algorithm called output weight optimization-hidden weight optimization (OWOHWO). In OWO-HWO, output weights and hidden unit weights are alternately modified to reduce the training error. The algorithm modifies the hidden weights based on the minimization of the MSE between a desired and the actual net function, as originally proposed by Scalero and Tepedelenlioglu (Scalero and Tepedelenlioglu 1992). Although, OWO-HWO greatly increases the training speed, it still has room for improvement (Wang and Chen 1996) because it uses the delta function, which is just the gradient information, as the desired net function change. In addition, HWO is equivalent to BP applied to the hidden weights under certain conditions (Chen, Manry and Chandrasekaran 1999). In this paper, a Newton-like method is used to improve hidden layer training. First, we review OWO-HWO training. Then, we propose new desired hidden layer net function changes using Hessian matrix information. Next, we derive a weighted hidden layer error function from the global training error function, which de-emphasizes error in saturated hidden units. We compare the improved training algorithm with the original OWO-HWO and LM algorithms with simulations on three remote sensing training data sets. The OWO-HWO Algorithm Without loss of generality, we restrict our discussion to a three layer fully connected MLP with linear output activation functions. First, we describe the network structure and our notation. Then we review the OWOHWO algorithm for our MLP. Fully Connected MLP Notation The network structure is shown in Fig. 1. For clarity, the bypass weights from input layer to output layer are not shown. The training data set consists of Nv training patterns {(xp, tp)}, where the pth input vector xp and the pth desired output vector tp have dimensions N and M, respectively. Thresholds in the hidden and output layers are handled by letting xp,(N+1)=1. For the jth hidden unit, the net input netpj and the output activation Opj for the pth training pattern are Fig. 1 The network structure
منابع مشابه
Edge Detection with Hessian Matrix Property Based on Wavelet Transform
In this paper, we present an edge detection method based on wavelet transform and Hessian matrix of image at each pixel. Many methods which based on wavelet transform, use wavelet transform to approximate the gradient of image and detect edges by searching the modulus maximum of gradient vectors. In our scheme, we use wavelet transform to approximate Hessian matrix of image at each pixel, too. ...
متن کاملMultiple optimal learning factors for feed-forward networks
A batch training algorithm for feed-forward networks is proposed which uses Newton’s method to estimate a vector of optimal learning factors, one for each hidden unit. Backpropagation, using this learning factor vector, is used to modify the hidden unit’s input weights. Linear equations are then solved for the network’s output weights. Elements of the new method’s Gauss-Newton Hessian matrix ar...
متن کاملTowards a Mathematical Understanding of the Difficulty in Learning with Feedforward Neural Networks
Despite the recent success of deep neural networks in various applications, designing and training deep neural networks is still among the greatest challenges in the field. In this work, we address the challenge of designing and training feedforward Multilayer Perceptrons (MLPs) from a smooth optimisation perspective. By characterising the critical point conditions of an MLP based loss function...
متن کاملIterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian-Vector Multiply
The online incremental gradient (or backpropagation) algorithm is widely considered to be the fastest method for solving large-scale neural-network (NN) learning problems. In contrast, we show that an appropriately implemented iterative batch-mode (or block-mode) learning method can be much faster. For example, it is three times faster in the UCI letter classification problem (26 outputs, 16,00...
متن کاملImproved Neural Network Initialization by Grouping Context-Dependent Targets for Acoustic Modeling
Neural Network (NN) Acoustic Models (AMs) are usually trained using context-dependent Hidden Markov Model (CDHMM) states as independent targets. For example, the CDHMM states of A-b-2 (second variant of beginning state of A) and A-m-1 (first variant of middle state of A) both correspond to the phone A, and A-b-1 and A-b-2 both correspond to the Context-independent HMM (CI-HMM) state A-b, but th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004